Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Sequencing and Raw Sequence Data Quality Control ◾ 37

Figure 1.29 shows a common deterioration of the quality of bases toward the end of the

reads produced by short-read sequencing instruments. We can also notice that some qual-

ity scores in some positions are low as 2 Phred (probability of error is 0.6).

The report shows three failures and a single warning: failed per base sequence quality

(Figure 1.29), failed per base sequence content and failed k-mer content (Figure 1.30), and

overrepresented sequences warning (Figure 1.31).

The QC processing strategies are different from a FASTQ file to another depending on

the failed metrics. Understanding the problem always gives a good idea about which kinds

of QC processing to perform. In our example file, we will begin by filtering the low-quality

reads and clipping the overrepresented sequences and then we will run FastQC again to see

how the quality is improved.

First, we will try to fix the per base quality score of the reads in the FASTQ file by

using “fastq_quality_filter” to keep the reads that have 80% of the bases which have qual-

ity scores equal or greater than 28. The following script performs filtering (the output file

is “filtered.fastq”), runs FastQC to generate the new QC report, and then runs Firefox to

display the QC report on the Internet browser:

fastq_quality_filter \

-i bad.fastq \

-q 28 \

FIGURE 1.30 Failed per base sequence content and k-mer content.

FIGURE 1.31 Overrepresented sequences that raised a warning.